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ARTICLE 



Genome-wide patterns of identity-by-descent sharing 
in the French Canadian founder population 

HeloYse Gauvin^'^, Claudia Moreau^, Jean-Frangois Lefebvre^, Catherine Laprise^, Helene Vezina^, 
Damian Labuda^'^ and Marie-Helene Roy-Gagnon'^'^'^ 

In genetics the ability to accurately describe the familial relationships among a group of individuals can be very useful. 
Recent statistical tools succeeded in assessing the degree of relatedness up to 6-7 generations with good power using dense 
genome-wide single-nucleotide polymorphism data to estimate the extent of identity-by-descent (IBD) sharing. It is therefore 
important to describe genome-wide patterns of IBD sharing for more remote and complex relatedness between individuals, such 
as that observed in a founder population like Quebec, Canada. Taking advantage of the extended genealogical records of the 
French Canadian founder population, we first compared different tools to identify regions of IBD in order to best describe 
genome-wide IBD sharing and its correlation with genealogical characteristics. Results showed that the extent of IBD sharing 
identified with Fasti BD correlates best with relatedness measured using genealogical data. Total length of IBD sharing explained 
85% of the genealogical kinship's variance. In addition, we observed significantly higher sharing in pairs of individuals with at 
least one inbred ancestor compared with those without any. Furthermore, patterns of IBD sharing and average sharing were 
different across regional populations, consistent with the settlement history of Quebec. Our results suggest that, as expected, 
the complex relatedness present in founder populations is reflected in patterns of IBD sharing. Using these patterns, it is thus 
possible to gain insight on the types of distant relationships in a sample from a founder population like Quebec. 
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INTRODUCTION 

In genetics research, the ability to accurately describe the familial 
relationships among a group of individuals can be very useful. For 
example, genome-wide association studies generally assume that 
studied subjects are independent and this assumption can be assessed 
easily if the list of their recent ancestors is known and error-free. 
Almost everybody can identif)^ their parents and generally also their 
grand-parents or even great- grand-parents. However, most people do 
not know about their ancestors more remote than two or three 
generations unless extensive genealogical records are available for the 
population studied, as in the cases of the Hutterites,^ Icelanders^ or 
Amish,^ for example. 

Another way to describe relationships among individuals in a data 
set is to look directly at their genome. Recent statistical tools 
succeeded in assessing the degree of relatedness up to 6-7 generations 
with good power using identity-by-descent (IBD) sharing.^ 
IBD sharing, estimated with genome- wide single-nucleotide 
polymorphism (SNP) data, is defined as segments of the genome 
shared identically between two individuals. These chromosome 
segments are identical-by-state (IBS) and descend from a 
common ancestor without occurrence of any recombination event.^ 
A segment IBD is always IBS but the reverse is not necessarily true 
unless the time scale is unlimited. In practice, IBD detection from 
SNPs captures relatively recent ancestry since the resolution of IBD 



segment detection in a specific data set limits the time scale that can 
be considered.^ 

Following the important technological innovations that made large 
amounts of genome-wide SNP data available at reasonable costs, 
several methods to detect IBD sharing between individuals have been 
developed. Approaches are generally based on the likelihood that a 
genetic sequence is IBD, which is measured with a probabilistic model 
detailing the whole IBD process or using the frequency of haplotypes, 
where low frequencies of a shared haplotype is an indication of highly 
probable IBD, or by setting a segment length threshold as a sequence 
is more likely to be IBD as it is spanning a large chromosomal 
segment. For example, GERMLINE is a method using a length 
threshold that builds up a dictionary with chunks of haplotypes and 
IBD segments are spotted in accordance with a minimal length and 
with some flexibility as genotyping errors might be present.^ The 
most flexible method is the hidden Markov model (HMM) that 
provides a basic framework to which probabilities for genotyping 
error and a linkage disequilibrium (LD) model can be added.^"^^ 
Haplotypes or genotypes can be used and some inference methods 
also use IBD detection to improve or to perform phasing. 
Simulations studies have shown that more complex models had 
lower false-discovery rates and higher sensitivity, in particular higher 
power to detect small segments, resulting in greater accuracy of IBD 
segment detection.^"^'^^ 
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Most comparisons of IBD inference methods have been conducted 
in homogeneous, unstructured populations and in simulation frame- 
works. In fact, to our knowledge, IBD inference methods have not 
been compared in a real-data setting with extensively documented 
genealogical records, and genome- wide patterns of IBD sharing have 
not been described for remote and complex relatedness such as that 
observed in the French Canadian founder population of the province 
of Quebec, Canada. The history of the French Canadian founder 
population begins with French settlers arriving at the beginning of 
17th century. Immigration from France ceased with the British 
Conquest in 1759. From 1755, Acadians, who were descendants of 
French pioneers who settled in Acadia (located in areas of present-day 
Nova Scotia, New Brunswick and Prince-Edward Island), started to 
move to several regions of Quebec, escaping the deportation led by 
the British. In the last part of the 18th century, American Loyalists, 
who wanted to stay under the British rule, also moved to Quebec. 
Meanwhile, the French Canadian population expanded rapidly in 
relative isolation caused by linguistic, religious and geographic 
barriers, which amplified the founder effect. As population size 
grew, settlers colonized new regions of Quebec, including remote and 
isolated regions, which resulted in population structure.^^'^^ 

In this study, we focused on three regions: the Saguenay-Lac-St-Jean, 
the western part of the North Shore and the Gaspe peninsula, as well 
as the two main cities of the province, Montreal and Quebec City 
(Figure 1 in Roy-Gagnon et aP-^). In Saguenay, French Canadian 
settlement started around 1840 with the arrival of inhabitants from 
the neighboring region of Charlevoix. Between 1840 and 1910, 75% of 
the 30 000 immigrants to Saguenay came from that region.^^ 



The region of the North Shore was mainly colonized by people 
from the Charlevoix and Bas-St-Laurent regions between 1840 and 
1920.^^ On the other side of the St Laurence River, in Gaspesia, 
permanent European settlement began some decades earlier. In the 
second half of the 18th century the Gaspe Peninsula first greeted 
Acadians. Soon after. Loyalists joined them. Lastly, French Canadians 
attracted by developing fishing, naval and lumber industries also 
moved to Gaspesia.^^ These three groups then evolved quite separately 
as they married mostly among themselves.^^ 

During the 19th and 20th century, immigration fi-om various 
origins mixed into the French Canadian population with a very 
limited genetic impact and it has been shown that early founders 
have a greater contribution to the current gene pool.^^'^^ Today, 
about 80% of the 8 million inhabitants of the province is French 
speaking.^^ 

The availabihty of genealogical data is a major advantage for 
genetic research in Quebec. Two important population registers exist: 
the BALSAC population register and the Early Quebec Population 
Register. The information contained in these databases comes 
primarily from vital statistics (births, marriages and deaths). As of 
November 2012, the BALSAC population register contained over 3 
million records, which have been computerized and linked to cover 
the whole province for the 19th and 20th centuries (mostly marriage 
records). The Early Quebec Population Register contains all records 
from the beginning of settlement (1608) to 1800 for a total of 700 000 
records.^^ Using these population registers, it is possible to reconstruct 
ascending genealogies of subjects from the present-day population 
going back over four centuries. 
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Figure 1 Distributions of genealogical characteristics. Histograms 
(ie, genealogical kinship >0). (a) Number of LCAs; (b) sum of LCAs' 
(c) distance to nearest LCA; and (d) mean distance to LCAs. 



of genealogical characteristics calculated for each of the 7704 related pairs 
inbreeding coefficients (Fs) among pairs having at least one inbred LCA {n= 1034); 
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In this study, we used extensive genealogical data from Quebec in 
combination with genome -wide SNP data to first compare inference 
of IBD sharing provided by different methods in order to best 
describe genome-wide IBD sharing and its correlation with genea- 
logical characteristics. IBD sharing detection was performed on a 
sample including seven populations of Quebec: French Canadians, 
Acadians and Loyalists from Gaspesia as well as French Canadians 
from Saguenay-Lac-St-Jean, North Shore, Quebec City and Montreal. 
Our analyses showed a good correlation between total length of IBD 
sharing and genealogical kinship coefficients for most methods with 
FastlBD yielding the best correlation overall. Using IBD results from 
FastlBD, we found differences in genome-wide IBD sharing patterns 
across sub-populations, which reflect genealogical characteristics. This 
information suggests that IBD sharing can reveal, at least in part, the 
complex relatedness present in a sample from a founder population 
like Quebec. 

MATERIAL AND METHODS 

Study population 

The data consist of 143 individuals from a previously reported sample from 
seven sub -populations of Quebec. Recruitment criteria focused on the 
geographical origin of participants, and as much as possible, we recruited 
participants with at least one parent born in the region before 1960 or who 
were themselves born in the region before 1960. For aU individuals, using the 
BALSAC population register and the early Quebec Population Register, 
genealogies were reconstructed as far back as possible and confirmed the 
absence of closely related individuals (first cousins and closer) in the sample. 
AU participants gave their informed consent and the CHU Sainte- Justine Ethics 
Committee approved the study protocol. 

For comparison purposes, we downloaded the original CEU sample (11 + 
III) from the International HapMap project.^^ We excluded two highly related 
individuals,^^ leading to a set of 109 individuals (more distantly related than 
cousins) with North-Western European origin (see Supplementary Table SI). 

Genotyping and quality control 

Sample from Quebec was genotyped on lUumina HumanHap650Y arrays at the 
McGill University and Genome Quebec Innovation Center. Quality control 
procedures were the same as in the first publication using this sample.^^ Briefly, 
quality check was performed to retain individuals and SNPs with at least 90% 
genotypes and to select only common autosomal SNPs (MAF>5%) in 
Hardy- Weinberg equilibrium (exact test,^^ P> 0.001). These restrictions 
yielded 140 individuals (20 Gaspesian French Canadian, 20 Acadians, 
20 Loyalists, 22 from Saguenay-Lac-St-Jean, 20 from the North Shore, 
16 from Quebec City and 22 from Montreal) and 539 742 SNPs. The same 
quality control criteria were applied to HapMap CEU, yielding 538 776 SNPs. 
AU genomic positions are according to NCBI build 37. 

Genealogical data and associated measures 

The completeness of the genealogical data is measured by the proportion of 
ancestors observed (i.e., ancestors for whom information is available) in the 
data at a given generation divided by the expected number of ancestors. The 
completeness of the genealogical data of our sample of 140 individuals is over 
90% up to the 5th generation and over 80% up to the 9th generation, except 
for the Gaspesian Loyalists (see Roy-Gagnon et al^^ for a more detailed 
description of completeness in these data). The lower amount of genealogical 
information available for the Loyalists sample is mainly due to their later 
arrival in Quebec and, to a lesser extent, to the fact that Protestant records were 
less complete and less well kept than Catholic records (which cover French 
Canadians and Acadians). 

To describe the sample, kinship and inbreeding coefficients were calculated 
using the S-Plus 8.0 (S-PLUS 8.0. Copyright 1988, 2007 Insightfiil Corp) 
function library GenLib. This library implements the algorithm of Karigl to 
calculate kinship coefficients.^^ We also used PedHunter software^ ^ to get the 
set of lowest common ancestors (LCAs) for each pair of individuals. LCAs are 



the most recent ancestors shared by a pair of individuals. A pair can have more 
than one LCA as long as no ancestor in the set of LCAs shares a descendant 
who is also an ancestor of the pair of individuals. We also obtained, using 
PedHunter, the length of the shortest paths from one member of the pair to 
the other member through their LCAs, named hereafter distances to LCAs. 
Once each set of LCAs was obtained, we calculated the inbreeding coefficients 
of these LCAs. We used the sum of these inbreeding coefficients to measure the 
total amount of inbreeding present among the LCAs. 

Genomic IBD sharing 

We selected five different methods to perform the detection of IBD segments, 
aU using a probabilistic framework except GERMLINE. GERMLINE is a 
computationally efficient software implementing a method that builds a 
dictionary of haplotypes to find matches between individuals. These matches 
are then extended to identify long shared segments, while allowing some 
flexibility by assuming an error rate per SNP in order to avoid too many false 
negatives caused by genotyping inaccuracies.^ Other methods are largely based 
on hidden Markov models (HMM). PLINK is the simplest method as it does 
not allow genotyping error and assumes that SNPs are in approximate linkage 
equilibrium.^^ IBDLD incorporates potential genotyping errors and missing 
data and has an extension for LD.^ The FastlBD method also includes a LD 
model when estimating IBD. The inference is conducted on sampled 
haplotypes for which an IBD score is calculated using shared haplotype 
frequency. Detected tracts are then extended and identified as being IBD 
according to a threshold set on score values.^ The last method that we 
considered, SLRP, also uses a HMM to approximate the IBD process while 
considering a genotyping error rate.^^ 

For all methods default parameters were used and some data manipulations 
were performed when necessary (Supplementary Table S2). For PLINK, which 
does not include LD, we did SNP pruning (pairwise r^<0.2 in sliding windows 
of size 50 shifting every 5 SNPs) leading to a subset of 65 959 SNPs. For 
GERMLINE, we phased data with two different methods; Beagle version 3.3.1^^ 
and ShapelT version 1.378.^^ For all analyses, we kept only segments greater 
than or equal to 2 cm, corresponding to the expected length of segments for 
common ancestors up to 25 generations ago.^'*'^^ This length ensures a good 
sensitivity and limits the false-discovery rate.^'^'^"*'^^ 

Statistical analysis 

We first examined the correlation between IBD sharing identified with the 
different methods and genealogical kinship coefficients. We used the total 
length of all segments shared IBD and calculated Pearson's correlation 
coefficients. Assuming that genealogical kinship is the true expected kinship, 
we selected the method providing the best correlation as the best method for 
our population and retained this method for further analyses. We also 
examined the distribution of the lengths of the IBD segments identified by 
each method and we considered computation time. 

We then examined the relationships between genomic IBD sharing and 
genealogical characteristics using simple linear regression models. We also 
looked at genomic sharing in pairs of individuals with or without at least one 
inbred LCA. Lasfly, we investigated differences in IBD among the sub- 
populations. We plotted the average number of segments of a certain size 
shared per pair of individuals and also the proportion of pairs of individuals 
having IBD sharing at each position on the genome. 

RESULTS 

Genealogical description 

Levels of relatedness among individuals within the different sub- 
populations, as measured by the kinship coefficients estimated from 
the genealogical data, vary greatly (Supplementary Figure SI). As 
described in Roy-Gagnon et aU^^ people from Saguenay and North 
Shore as well as Acadians had higher levels of kinship, while 
populations from Montreal and Quebec City areas were less related. 
These observations are consistent with the settlement history of the 
province of Quebec and are also supported by previous findings based 
on genealogical data that emphasized a West-East decreasing gradient 
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of diversity among regional populations as well as a stratification of 
regional populations.^^ 

Figure la presents the distributions of the number of LCAs per pair 
of individuals excluding unrelated pairs according to the genealogical 
kinship coefficients (ie, pairs of individuals with kinship = 0). Pairs of 
individuals with kinship value equals to zero are pairs unrelated 
relatively to the time scale considered or related but without enough 
genealogical information available to support the relationship. In the 
whole sample, the average number of LCAs per pair of individuals 
was 74, ranging from 2-433. Average numbers of LCAs for the sub- 
populations ranged from 2.3 (Loyalists) to 152.6 (Montreal area), and 
the distributions were significantly different among sub-populations 
(all Kolmogorov-Smirnov test P-values< 0.007). For each related pair, 
we also looked at the distance to the most recent LCA and the mean 
distance to LCAs (Figure Ic and d), which were on average 15.5 
(ranging from 5-24) and 19.8 (ranging from 7-24), respectively. 
These distributions were also significantly different across populations 
(P-values<0.02) except for minimal distance to LCA for Loyalists 
compared with Acadians and Gaspesian French Canadians compared 
with North Shore. 

We also described inbreeding among LCAs. Only 13% of pairs of 
related individuals had one inbred LCA or more but this percentage 
varied greatly from one population to another. The proportions of 
pairs with at least one inbred LCA was more than half for the 
Saguenay, North Shore and Acadian populations, 31% for Gaspesian 
French Canadians and < 8% for the other populations. The number 
of inbred LCAs for a pair of individuals with LCAs ranged from 0-13 
and inbreeding coefficients ranged from 0.00006-0.06, which are 
approximately equivalent to individuals with parents that are seventh- 
degree relatives and first cousins, respectively. Figure lb shows the 
sum of all LCAs' inbreeding coefficients, which is the measure that we 
chose to summarize the inbreeding information. This sum ranged 
from 0.0001-0.2 for pairs of individuals with at least one inbred LCA. 
Overall, distributions of genealogical characteristics reflect the diver- 
sity and complexity of the relationships present in the structured 
founder population of Quebec. 

Comparison of different IBD sharing detection methods 

Before comparing results from selected methods, we looked at results 
from the only method using phased data, GERMLINE, for which we 
used two different phasing methods (ShapelT and Beagle). Haplo- 
types obtained with different phasing methods are not consistent and 
this might impact IBD inference. Indeed, data phased with ShapelT 
provided IBD results that were more strongly correlated with the 
genealogical information than those phased with Beagle. The correla- 
tion between total length of IBD segments and genealogical kinship 
coefficients for results from GERMLINE was 0.92 for genotype phased 
with ShapelT and 0.72 for genotype phased with Beagle. Hence, we 
retained GERMLINE's results with ShapelT phasing for further 
analyses. 

In the whole sample from the Province of Quebec, we observed 
Pearson's correlation coefficients ranging from 0.69-0.92 for the total 
length of IBD sharing identified with the different methods against 
the genealogical kinship coefficient (Table 1, Supplementary Figure S2). 
Three methods (GERMLINE, FastlBD and IBDLD) stand out with 
correlation coefficients of 0.92. IBD sharing identified by PLINK and 
SLRP was less concordant with genealogical information. 

To get a better idea of which method provided the most 
appropriate results for our data, we further examined the correlation 
between IBD sharing and kinship in each sub-population separately. 
Correlations varied across populations (Table 1). We noted the low 



Table 1 Pearson's correlation coefficients between total length of 
IBD sharing and kinship coefficients for each population and each 
method 



Methods 



Population 


PLINK 


GERMLINE 


FastlBD 


IBDLD 


SLRP 


ACA 


0.87 


0.88 


0.89 


0.89 


0.85 


GFC 


0.92 


0.91 


0.92 


0.92 


0.88 


LOY 


-0.03 


0.84 


0.86 


0.86 


0.01 


MON 


0.10 


0.39 


0.46 


0.45 


-0.02 


NS 


0.85 


0.88 


0.90 


0.88 


0.83 


QUE 


0.09 


0.31 


0.45 


0.42 


0.15 


SAG 


0.15 


0.82 


0.84 


0.83 


0.13 


PQ 


0.77 


0.92 


0.92 


0.92 


0.69 



Abbreviations: ACA, Acadians; GFC, Gaspesian Frencin Canadians; LOY, Loyalists; NS, Nortin 
Sliore; IVION, IVIontreal; QUE, Quebec City area; SAG, Saguenay; PQ, winole sannple from tine 
Province of Quebec. 



correlation of IBD sharing inferred by some methods in the Saguenay 
region with kinship coefficient despite the presence of a noteworthy 
degree of relatedness among individuals in this region. Less surpris- 
ingly, populations with lower expected relatedness, such as Montreal 
and Quebec City areas, had lower correlations with values ranging 
from 0.02-0.46. We also noted that correlations found for the 
Loyalists were either very good (0.84-0.86) or very weak ( —0.03 or 
0.01). 

Results from FastlBD were retained for further analyses. Assuming 
that genealogical kinship is the true expected kinship, FastlBD was 
among the fastest (see Supplementary Table S3 for detailed informa- 
tion on computation time) and best reflected the relatedness 
described by our genealogical data, as evaluated by the correlation 
between total length of IBD sharing and genealogical kinship 
coefficient. 

Genealogical measures versus inferred IBD sharing 

Before looking at the relationship between IBD sharing and different 
genealogical variables, we examined the impact of genealogical 
completeness on the correlation between total length of IBD sharing 
and the genealogical kinship coefficient. We recalculated the correla- 
tion coefficients with pairs of individuals having >50% of their 
genealogical information complete at the 5th generation and also with 
the same completeness at the 10th generation. Almost no change was 
observed at the 5th generation, while at the 10th changes in 
correlation coefficients were smaU (0-0.10, all within one s.d. of the 
estimates) except for the Loyalists that did not have enough complete 
pairs at the 10th generation to recalculate the correlation. We chose to 
keep all pairs in our sample. 

As IBD sharing was highly correlated with genealogical kinship 
coefficient, a simple linear regression fits the data well. Hence, the 
overall degree of relatedness is well captured by overall IBD sharing 
with 85% of the variance in kinship coefficients explained by total 
length of IBD sharing (Figure 2a). Total length of IBD sharing also 
reflected characteristics of relatedness, such as shorter distance to LCA 
or having an inbred ancestor. Total length of IBD sharing explained 
26% of the variance in the mean distance to LCAs, 39% of the 
variance in the distance to the nearest LCA and 31% of the variance in 
the sum of LCAs' inbreeding coefficients (Figure 2b-d). As pairs of 
individuals sharing an inbred common ancestor seemed to be a 
distinct group we separated the whole sample based on this criterion 
to assess the impact on IBD sharing. Comparing the two groups 
obtained, we observed significantly more IBD sharing for pairs having 
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Figure 2 IBD sharing and genealogical characteristics. Scatter plots of total length of IBD sharing versus genealogical characteristics for each pair 
(a?=9730): (a) kinship coefficients; (b) distance to nearest LCA; (c) sum of LCAs' inbreeding coefficients (Fs); and (d) mean distance to LCAs. A simple 
linear regression line is plotted in gray on each graph. 



at least one inbred LCA (Supplementary Figure S3). These pairs have, 
on average, 7.2 times more total length of IBD sharing and 4.5 times 
more IBD segments. 

IBD sharing in populations 

The amount of IBD sharing per population is shown on Figure 3. 
Each dot represents the mean number of segments shared per pair for 
specific length ranging from 2-15 cm. The number of segments and 
their length vary with the degree of relationships, yielding distinct 
curves for the different levels of kinship present in the populations. 
The Acadians, which have the highest levels of kinship, have a curve 
well above the other populations. The Saguenay and North Shore 
curves overlap, reflecting similar kinship levels in these two popula- 
tions. Montreal and Quebec City show lower and more variable levels 
of IBD sharing. We also observed a clear difference between our whole 
sample from Quebec and the HapMap CEU sample. On average pairs 
of individuals from Quebec shared 3.8 IBD segments and have 21.3 cm 
of IBD sharing, while those from HapMap CEU share 2.7 IBD 
segments and have 8.0 cm of IBD sharing. Thus, pairs of individuals 
from Quebec also shared longer segments, with segments smaller than 
5 CM representing 63 and 96% of segments for Quebec and HapMap 
CEU, respectively. 

Whole-genome IBD sharing 

Figure 4 shows the proportions of pairs of individuals having IBD 
sharing at specific chromosomal positions across the whole genome. 
Patterns across populations are different and, as in Figure 3, we can 
see that average sharing differs among populations, with the CEU 
sharing less than the Quebec population. Some IBD sharing seems 
consistent across populations, for example, around the HLA region 
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Figure 3 Pairwise IBD sharing in each population. The mean number of 
segments shared is shown (y axis, log-scale) per pair for specific 1cm class 
length ranging from 2-15 cm. ACA, Acadians, GFC, Gaspesian French 
Canadians; LOY, Loyalists; NS, North Shore; MON, Montreal; QUE, Quebec 
City area; SAG, Saguenay, PQ whole sample from the Province of Quebec, 
CEU HapMap. 

on chromosome 6 where a peak can be observed for the whole 
Quebec sample and CEU sample. 

DISCUSSION 

In this study, we first compared IBD inference provided by five 
different methods by correlating total length of IBD sharing with 
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genealogical kinship. To our knowledge, our study is the first to 
provide a comparison of the performance of different methods in a 
complex data set from a founder population. It is difficult to evaluate 
the performance of methods in a real-data setting as we do not know 
the truth. Because of the availability of extended genealogies in our 
population, we could evaluate, at least in part, the performance of 
IBD detection methods by comparing them with genealogical 
information. Our results confirmed the importance of a well-defined 
and flexible model or algorithm for IBD inference and identified 
FastlBD, GERMLINE and IBDLD as the best-performing methods 
based on the high correlations between total length of IBD segments 
shared by pairs of individuals and their genealogical kinship 
coefficient. As noted in previous studies, simple models that do not 
consider genotyping errors and LD, such as that implemented in 
PLINK, yield lower resolution of IBD detection.^'^^ In our sample, the 
smallest segment detected with PLINK was 3.8 cm long, almost twice 
as our threshold of 2 cm. With respect to genotyping errors, a 
modification of PLINK has been proposed in order to include 



genotyping confidence scores into the IBD inference process, which 
could improve IBD inference.^^ SLRP has previously been shown to 
yield a high accuracy of IBD detection compared with GERMLINE 
and FastlBD in simulated data.^^ However, in our population, SLRP 
identified more IBD sharing than the other methods, while yielding 
the lowest correlation coefficients with genealogical kinship overall. 

We selected FastlBD for further analyses, because IBD sharing 
within populations was more associated with genealogical informa- 
tion with this method and it was fast to run. We recognize that we did 
not optimize the parameters selected for each method but simply 
used the ones recommended by the authors for their methods. 
Parameter optimization could have affected our comparison and 
improved our results. Fiowever, our results are consistent with most 
simulations reported in the literature comparing different methods. 
We also restricted our study to five methods as other existing methods 
were more difficult to use or not implemented in a software. ^^'^^'^^'^^ 

Using results from FastlBD, we then related IBD sharing to the 
different genealogical measures. Total length of IBD sharing explained 
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a large portion of the variance in kinship coefficients. Our results 
highlight the variability in realized IBD sharing for a variety of pairs 
of remotely related individuals with known kinship. Not surprisingly, 
total length of IBD sharing also explained more of the variance in the 
distance to nearest LCAs than of the variance in mean distance to 
LCA as the most recent ancestors have a higher impact on IBD 
sharing. We aggregated inbreeding coefficients from LCAs into a 
unique sum and found that IBD sharing also explained a noteworthy 
part of its variance. However, we are conscious that a pair of 
individuals could have no inbred LCA identified but still share a 
more distant inbred ancestor. This occurred for only 20 pairs of 
individuals. Some shared inbred LCAs may also not be identified 
because of lower genealogical completeness. This might explain a few 
pairs of individuals without inbred LCAs (shown as outliers on 
Supplementary Figure S3) that had an important amount of IBD 
sharing compared with their group average and that had inbred 
ancestors that were not shared according to the information available. 

Length of segments identified in the different populations was also 
a good way to identify population differences. The odds of sharing 
more segments as well as longer segments were higher in population 
with more relatedness, as expected. Furthermore, IBD sharing in the 
whole sample was very high and, as expected, mean length of 
segments inferred (data not shown) was higher than in any other 
Hap Map or Ashkenazi Jewish populations considered in Gusev et af^ 
except for one sample in which many pairs were closely related (closer 
than 1st degree cousin according to IBD inference). The high IBD 
sharing and increased proportion of longer segments is explained by 
founder events that occurred and population expansions following 
them.^^ The fact that the Saguenay population size underwent a 
25-fold increase in only a century, from 1861-1961, while the whole 
Quebec population increased about five times, is in good conjunction 
with our results. 

Despite important differences in mean proportion of IBD sharing 
between Quebec and HapMap CEU, we noted the presence of a 
common peak of IBD on chromosome 6 covering the HLA region. 
Increased IBD sharing has been reported for this region in several 
populations and could be the results of selection. ^^'^^ As pointed by 
Browning and Browning,^^ IBD inference will be facilitated in region 
with high LD but LD may also lead to overestimating the true IBD 
sharing relative to recent common ancestor. Knowing that important 
LD normally arises in presence of natural selection, the relevance of 
an excess of IBD sharing in the HLA region should be investigated 
more deeply. 

IBD detection methods are useful in many contexts such as 
identifying phasing errors or polymorphic deletions, estimating 
heritability,^'^^ inferring kinship'^'^^'^'^"^^ and mapping diseases in 
association studies .^^'^^ In cases, where genealogical information is 
not available we now know that IBD is an alternative to account for 
unknown relatedness. Even in samples that are widely used such as 
those coming from CEPH, precautions are necessary as important 
consanguinity as been identified recently.^^ Our study is an additional 
example putting forward the importance of considering relatedness in 
a sample before studying it. The high correlation that we observed 
between genealogical information and IBD sharing, over the wide 
range of remote relatedness present in our study population, further 
demonstrates the usefulness of genomic IBD detection to capture 
even complex relatedness involving inbreeding and our findings can 
guide the interpretation of results in other population without 
genealogical data. Our study highlights the great variety in types of 
relatedness present in the French Canadian founder population and 
how this complex relatedness is reflected in patterns of IBD sharing. 



Using these patterns, it is thus possible to gain insight on the types of 
distant relatedness in a sample from a founder population like 
Quebec, leading to better genetic study design and analysis. 
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